Load a Sheet, assume there's no logic, we'll add the logic

You'll have run jupyter notebook at the command line or via the Windows Anaconda tool.

In this demo we add logic in place of Excel cell functions and export a finished sheet for comparison.


In [18]:
import pandas as pd

df = pd.read_excel("sheet_1_without_simple_logic.xls")
print(df)
# note the NaN (not-a-number) cells when we have no value


   Feature1  Feature2  DecisionF1F2         TVShow  Decision2  Decision3  \
0       0.6       0.6           NaN      Hollyoaks        NaN        NaN   
1       0.4       0.6           NaN      hollyoaks        NaN        NaN   
2       0.3       0.4           NaN     Hollyoaks         NaN        NaN   
3       0.9       0.8           NaN      hollyoaks        NaN        NaN   
4       0.9       0.8           NaN     holly-oaks        NaN        NaN   
5       0.9       0.8           NaN  best TV shows        NaN        NaN   

                    Comment  
0                       NaN  
1                       NaN  
2  trailing-space on TVShow  
3                       NaN  
4        badly-spelt TVShow  
5                       NaN  

In [19]:
df.head(10) # this creates a Table view (non-interactive but prettier)


Out[19]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 NaN Hollyoaks NaN NaN NaN
1 0.4 0.6 NaN hollyoaks NaN NaN NaN
2 0.3 0.4 NaN Hollyoaks NaN NaN trailing-space on TVShow
3 0.9 0.8 NaN hollyoaks NaN NaN NaN
4 0.9 0.8 NaN holly-oaks NaN NaN badly-spelt TVShow
5 0.9 0.8 NaN best TV shows NaN NaN NaN

In [20]:
print("Column names:", df.columns)


Column names: Index(['Feature1', 'Feature2', 'DecisionF1F2', 'TVShow', 'Decision2',
       'Decision3', 'Comment'],
      dtype='object')

In [21]:
print("Information about each row including data types:")
print("(note - type 'object' is catch-all that includes strings)")
df.info()


Information about each row including data types:
(note - type 'object' is catch-all that includes strings)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 7 columns):
Feature1        6 non-null float64
Feature2        6 non-null float64
DecisionF1F2    0 non-null float64
TVShow          6 non-null object
Decision2       0 non-null float64
Decision3       0 non-null float64
Comment         2 non-null object
dtypes: float64(5), object(2)
memory usage: 384.0+ bytes

In [22]:
print("\nWe can extract a column of data as a Series object:")
print(df['Feature1'])


We can extract a column of data as a Series object:
0    0.6
1    0.4
2    0.3
3    0.9
4    0.9
5    0.9
Name: Feature1, dtype: float64

In [23]:
row = df.ix[0]
print("\nWe can extract a row as a Python dictionary:")
print(row)


We can extract a row as a Python dictionary:
Feature1              0.6
Feature2              0.6
DecisionF1F2          NaN
TVShow          Hollyoaks
Decision2             NaN
Decision3             NaN
Comment               NaN
Name: 0, dtype: object

In [24]:
print("\nRow items, e.g. Feature1={feature1}".format(feature1=row['Feature1']))


Row items, e.g. Feature1=0.6000000000000001

In [25]:
def decision_f1_f2(row):
    feature_1 = row['Feature1']
    feature_2 = row['Feature2']
    if feature_1 > 0.5:
        if feature_2 > 0.5:
            return True
    return False

# we'll use apply on the entire DataFrame, axis=1 means row-wise (not column-wise)
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
df.apply(decision_f1_f2, axis=1)
# note this doesn't change the DataFrame, it generates a new separate Series
# and here we just print it and then discard it


Out[25]:
0     True
1    False
2    False
3     True
4     True
5     True
dtype: bool

In [27]:
df.head(10) # confirm we haven't modified the DataFrame


Out[27]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 NaN Hollyoaks NaN NaN NaN
1 0.4 0.6 NaN hollyoaks NaN NaN NaN
2 0.3 0.4 NaN Hollyoaks NaN NaN trailing-space on TVShow
3 0.9 0.8 NaN hollyoaks NaN NaN NaN
4 0.9 0.8 NaN holly-oaks NaN NaN badly-spelt TVShow
5 0.9 0.8 NaN best TV shows NaN NaN NaN

In [29]:
# we can assign the result back to the DataFrame as a new column
new_result = df.apply(decision_f1_f2, axis=1)
df['DecisionF1F2'] = new_result
df.head(10)


Out[29]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks NaN NaN NaN
1 0.4 0.6 False hollyoaks NaN NaN NaN
2 0.3 0.4 False Hollyoaks NaN NaN trailing-space on TVShow
3 0.9 0.8 True hollyoaks NaN NaN NaN
4 0.9 0.8 True holly-oaks NaN NaN badly-spelt TVShow
5 0.9 0.8 True best TV shows NaN NaN NaN

In [31]:
def decision_tvshow(cell):
    cleaned_cell = cell
    cleaned_cell = cleaned_cell.lower() # lowercase
    is_hollyoaks = cleaned_cell == "hollyoaks"
    return is_hollyoaks

# we can assign the result back to the DataFrame as a new column
df['Decision2'] = df['TVShow'].apply(decision_tvshow)
df.head(10)


Out[31]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks True NaN NaN
1 0.4 0.6 False hollyoaks True NaN NaN
2 0.3 0.4 False Hollyoaks False NaN trailing-space on TVShow
3 0.9 0.8 True hollyoaks True NaN NaN
4 0.9 0.8 True holly-oaks False NaN badly-spelt TVShow
5 0.9 0.8 True best TV shows False NaN NaN

Add a strip to remove whitespace from the "Hollyoaks " example


In [32]:
def decision_tvshow_is_hollyoaks(cell):
    cleaned_cell = cell
    cleaned_cell = cleaned_cell.lower() # lowercase
    cleaned_cell = cleaned_cell.strip() # remove superflous whitespace
    is_hollyoaks = cleaned_cell == "hollyoaks"
    return is_hollyoaks

# we can assign the result back to the DataFrame as a new column
df['Decision2'] = df['TVShow'].apply(decision_tvshow_is_hollyoaks)
df.head(10)


Out[32]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks True NaN NaN
1 0.4 0.6 False hollyoaks True NaN NaN
2 0.3 0.4 False Hollyoaks True NaN trailing-space on TVShow
3 0.9 0.8 True hollyoaks True NaN NaN
4 0.9 0.8 True holly-oaks False NaN badly-spelt TVShow
5 0.9 0.8 True best TV shows False NaN NaN

Combine DecisionF1F2 and Decision2


In [34]:
# use a different way to access the columns (using .colname - this only works for
# easy-to-read ASCII names, it won't work with funny characters)
# and use the logical and (&) to do pairwise logic, assigning the result to our new column
df['Decision3'] = df.DecisionF1F2 & df.Decision2
df.head(10)


Out[34]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks True True NaN
1 0.4 0.6 False hollyoaks True False NaN
2 0.3 0.4 False Hollyoaks True False trailing-space on TVShow
3 0.9 0.8 True hollyoaks True True NaN
4 0.9 0.8 True holly-oaks False False badly-spelt TVShow
5 0.9 0.8 True best TV shows False False NaN

Output a new XLS


In [35]:
writer = pd.ExcelWriter('sheet_1_with_added_logic_generated_via_pandas.xlsx', engine='xlsxwriter')
df.to_excel(writer, index=False, sheet_name='Our New Sheet')
workbook = writer.book
worksheet = writer.sheets['Our New Sheet']
writer.save()
# note we could add lots of conditional formatting for Excel via:
# http://pbpython.com/improve-pandas-excel-output.html

Let's be clever - we'll do some text processing

We'll try the external fuzzywuzzy library to do some very simple text processing.

If needed install using conda install -c wpb fuzzywuzzy.


In [47]:
from fuzzywuzzy import fuzz
# http://pbpython.com/excel-pandas-comp.html side reading on fuzzywuzzy
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/ other fuzzywuzzy metrics for 
# other ways of measuring similarity

In [37]:
# quick demo:
item1 = "hollyoaks"
target = "hollyoaks"
print(target, fuzz.ratio(item1, target))
target = 'holly-oaks'
print(target, fuzz.ratio(item1, target))
target = 'holly oak'
print(target, fuzz.ratio(item1, target))
target = "tv's best moments"
print(target, fuzz.ratio(item1, target))


hollyoaks 100
holly-oaks 95
holly oak 89
tv's best moments 15

In [41]:
def decision_tvshow_is_hollyoaks_with_smarts(cell):
    cleaned_cell = cell
    cleaned_cell = cleaned_cell.lower() # lowercase
    cleaned_cell = cleaned_cell.strip() # remove superflous whitespace
    is_hollyoaks =  fuzz.ratio("hollyoaks", cleaned_cell) > 85
    return is_hollyoaks

# we can assign the result back to the DataFrame as a new column
df['TVShow'].apply(decision_tvshow_is_hollyoaks_with_smarts)


Out[41]:
0     True
1     True
2     True
3     True
4     True
5    False
Name: TVShow, dtype: bool

In [42]:
df.head(10) # show that we *haven't* yet manipulated the DataFrame


Out[42]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks True True NaN
1 0.4 0.6 False hollyoaks True False NaN
2 0.3 0.4 False Hollyoaks True False trailing-space on TVShow
3 0.9 0.8 True hollyoaks True True NaN
4 0.9 0.8 True holly-oaks False False badly-spelt TVShow
5 0.9 0.8 True best TV shows False False NaN

In [45]:
df['Decision2'] = df['TVShow'].apply(decision_tvshow_is_hollyoaks_with_smarts)
df.head(10)
# note that we *haven't re-run Decision3*!


Out[45]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks True True NaN
1 0.4 0.6 False hollyoaks True False NaN
2 0.3 0.4 False Hollyoaks True False trailing-space on TVShow
3 0.9 0.8 True hollyoaks True True NaN
4 0.9 0.8 True holly-oaks True False badly-spelt TVShow
5 0.9 0.8 True best TV shows False False NaN

In [46]:
df['Decision3'] = df.DecisionF1F2 & df.Decision2
df.head(10)


Out[46]:
Feature1 Feature2 DecisionF1F2 TVShow Decision2 Decision3 Comment
0 0.6 0.6 True Hollyoaks True True NaN
1 0.4 0.6 False hollyoaks True False NaN
2 0.3 0.4 False Hollyoaks True False trailing-space on TVShow
3 0.9 0.8 True hollyoaks True True NaN
4 0.9 0.8 True holly-oaks True True badly-spelt TVShow
5 0.9 0.8 True best TV shows False False NaN

In [44]:
# We could copy the Excel code down here and write out a new spreadsheet to disk...

In [ ]: